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Abstract 

In this paper we provide an alternative approach to the works of the 
physicists S. Cocco and R. Monasson about a model of DNA molecules. 
The aim is to predict the sequence of bases by mechanical stimulations. 
The model described by the physicists is a stopped birth and death pro- 
cess with unknown transition probabilities. We consider two models, 
a discrete in time and a continuous in time, as general as possible. We 
show that explicit formula can be obtained for the probability to be 
wrong for a given estimator, and apply it to evaluate the quality of the 
prediction. Also we add some generalizations comparing to the initial 
model allowing us to answer some questions asked by the physicists. 



1 Introduction 

1.1 The physical approach 

In this introduction we first summarize some ideas and results of the works of 
V. Baldazzi, S. Cocco, E. Marinari and R. Monasson ([^, and S. Cocco 
and R. Monasson [7j who are interested in a method for DNA molecules 
sequencing. They study a mechanical way, described below, instead of tradi- 
tional bio-chemical or gel electrophoresis technics. The experiments for me- 
chanical unzipping were first realized by Bockelmann, Helsot and coworkers 
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[5] and [6]. The principle is based on the fact that the hnk strength between 
two bases of a given pair depends on whether it is a C = G or a yl — T (see 
Figure [1]) . Indeed the hnk A — T is weaker for biochemical reasons than the 
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Figure 1: X, = 5, bi : C = G, b2 : A - T 

link C = G. Moreover, there are also some stacking effects between adja- 
cent bases, that is to say, the force needed to break, for example, the link 
C = G is different if the C is following by a ^, or a T. This last factor is not 
negligible (see the table below) and therefore must be taken into account if 
we want the model to be as sharp as possible. 

We now give a brief description of the experiment (for more details see 
[3]), the extremities of the DNA molecule are stretched apart under a force 
/. The force / is chosen large enough in such way that the molecule can 
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Figure 2: Binding free energies (units of ksT) 
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be totally unzipped. However / is also not too strong so that naturally the 
molecule rebuilds itself. Though there is back and forth movement of the 
number of open pair bases, this back and force movement generates a signal 
which can be measured by biologists. This signal can be modeled by a birth 
and death process with unknown transition probabilities. 



1.2 The model 

We denote by M the length of the DNA chain and by (61, 62, • • • , ^a/) the 
sequence of bases of one of the strand of the molecule. So hi is the i*^ base 
which can be either a A, a T, a C or a G and the corresponding base of the 
other strand can be deduced. We consider both a discrete and a continuous 
time-sequence of the number of open base pairs, the first one is denoted X, 
the second one Y . We now make the link between X (and Y) and h. For 
this, we define the free energy g of the molecule when the first x base pairs 
are open: 

X 

9{x) ■■= ^go{bi,bi+i) -xgi{f). 

i=l 

There are two different parts: first, 5(0(64,61+1) is the binding energy of the 
pair i. Note that stacking effects are taken into account: go depends on the 
base content bi and on the next pair 6j+i. The second contribution gi{f) is 
the work to stretch under a force / the open part of the two strands when 
one more base pair is opened, in particular gi increases when / does. Note 
that gi is known, whereas Yli=i 50(64, 6j+i) is unknown as we are looking for 
the 6i's, in fact we assume that the sequence of bases is random. A typical 
trajectory of g, obtained by numerical simulations, is given in [1] page 7 and 
looks like Figure [3l 

The number of open pairs fiuctuates randomly with a distribution di- 
rectly connected to the difference of the free energy g between two consecu- 
tive base pairs. Therefore it can be represented by a random walk in random 
environment: 



The discrete case is defined as follows, assume that the random sequence 
6 := {bx, 1 < X < M) is fixed, then the transition probabilities of the 
number of open pairs are given by: for all 1 < x < M — 1, 

where /3 is a constant parameter which is proportional to the inverse of the 
temperature. Also we assume that the first base of the molecule is always 
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g(i), 0<i<M 
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Figure 3: A typical trajectory of g, M = 500 



open which means that pi = 1. Note that the larger is / the greater is the 
probability to open a new pair. We easily get a simple expression for this 
probability which is 

P(X.+i = x + 1\X, =x,b) = — \ ^, (2) 

where we denote 



Ag{bx,bx+i) ■■= go{bx,bx+i) - giif). (3) 

Formula ([2]) shows that we only need local information on the sequence b 
to get the transition probability at site x + 1, and that X can only move 
forward with probability px or backward with probability l—px- We discuss 
about some results on this well known model in the next section. A typical 
trajectory of X, obtained by numerical simulations, looks like Figure 4. 



For the continuous time model, the physicists also take into account the time 
it takes X to go from a site to another. Thus we introduce a second time 
continuous model Y. Given the go, when Y is at the site x, it jumps in 
X + 1 with rate re~l^^°^^'"'''^+^^ and in x — 1 with rate re~^^''^^^^ where r is a 
constant which value depends on biological parameters. That is, given the 
DNA sequence 6, y is a Markov process with finite state space {1, . . . , M} 
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killed when it hits M whose transition rates are for a; > 2, 



p{x,y) 



and for x = 1, 



^g-/39i(/) 

_j.^g-/3so(fea:,6a; + l) _|_ g-/3ai(/)^ 





if y = X + 1, 
if y = X — 1, 
if y = X, 
otherwise, 







if y = 2, 
ify = l, 
otherwise. 



The process Y can be represented as the couple {X, T) where X is the 
sequence of discrete jumps and has the same law as in ([1]) and T is the 
sequence of successive times spent in each site between two jumps. 
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Figure 4: A typical trajectory of the number of unzipping pairs, M = 500. 



Moreover, all along the paper, we assume that qq is injective on the first 
and second variables : i.e. for all a G {A, T, C, G} the functions 

9oia,-) - I ^ 9o{a,l) and go{.,a) : ^ 9oi7,a) (4) 

are injective. Note that this hypothesis matches with the experimental val- 
ues of the energy (see Figure [2]) . 

We describe now briefly some results obtained by the physicists in the 
continuous time case. 
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1.3 Some results obtained by the physicists 

In their papers [3], [3] and [7], they assume first that there is no stacking 
effect, considering that go at site X is only a function of bx and that i^Qoibx) 
is a sequence of independent and identically distributed (i.i.d.) random 
variables. In this case they compute the maximum likelihood estimator 
for hx- For a better accuracy they consider several total unzipping instead 
of a single one, that is to say they look at a sequence of R independent 
trajectories {Y^'-\ I < R). In a second step they study the decreasing of the 
probability that this estimator gives a base sequence, and they show that 
this probability decreases exponentially; for all x < M, 

+ bx) < eM-R/Rc{x)). 

The constant Rc is estimated numerically. For the general case (with stack- 
ing effects) they use Viterbi algorithm to compute the maximum of 
likelihood. Then they estimate the probability to be wrong with this es- 
timator by using both analytic and numerical methods, they get a similar 
result than for the independent case. 

After some discussions with S. Cocco and R. Monasson some questions rise: 
is it possible to get a general and rigorous method which can be applied to 
all these cases ? how the choice of the force can be used in order to improve 
the results ? and what is the difference between the discrete and continuous 
time model ? We study all these questions in the present paper. 

1.4 A mathematical point of view 

First we would like to recall some basic facts for the discrete time model. 
If we forget, for the moment, that the state space is finite, {X^, A: G N) 
is a random walk on a random environment on Z as Solomon defined it in 
[To] . We know, for example, that for i.i.d. sequence {goibx),x), if go{bi) 
has mean zero and gi = 0, then X is almost surely recurrent, it is transient 
on the other case. For the recurrent case, X is a Sinai's walk for the 
transient one, the first study is due to H. Kesten, M.V. Kozlov, F. Spizer 
[8]. Here we are interested on what a trajectory of the walk can say about 
the environment, this aspect has not been studied a lot, there is a paper of 
O. Adelman, N. Enriquez and for the special case of Sinai's walk a paper 
of P. Andreoletti [2]. More precisely [2] shows that g{x) can be estimated 
from a single trajectory of the walk by studying the asymptotics (in time) 
of the local time at site x, which is the amount of time the walk spends at 
this site. However this approach can not be used to give informations on a 
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particular site, typically on go{bx,bx+i) for a given x. 

To move from Solomon walks to the problem asked by the physicists we 
have to make a sacrifice, more especially we are no longer interested in 
asymptotics in time. Indeed if the time goes to infinity that means that 
either we have to wait a very long time to reach the end of the molecule, or 
once it is totally unzipped it can move back to the beginning. This last case 
is not possible because when the end of the molecule is reached then the two 
separate strands can not reform the molecule properly. In compensation, we 
only have to study the processes X or Y until they reach M, that is until 
time 

TM = mi{k>0,Xk =M}. 

So we are interested in the discrete time process {X^, k < tm) and the 
continuous one Y = {Xk,Tk, k < tm)- Note also that M is the length 
of the DNA molecule, in term of the number of pairs, which can be big 
but finite. The other good news is the fact that the DNA molecule can 
be unzipped a large number of times, we have called, this number R, and 
we will be looking at asymptotics in this variable. Finally we are looking 
at R independent trajectories denoted {zf^ , ^ < I < R, < ti < t^) of 

random walks on a same unknown environment b with the first time 
the walk I hits M (Z is either X or Y = {X,T)). Also we will see that 
even if we assume that the go are random, its distribution will not play an 
important role in our setting, essentially for two reasons the first one is the 
fact that the state space is finite and the second one is that we are looking at 
asymptotics in R. The method is based on the fact that, given the trajectory 
of a random walk (or R random walks) on an environment b, the probability 
that a given estimator b gives a good sequence (typically P(6 = 6)) depends 
only on elementary functions of the trajectory of this random walk. 
For the discrete time model, the important quantities are the number of 
times X goes from x to x + 1 or to x — 1, with x E [1, M — 1]: 

fc=0 k=0 
R R 

1=1 1=1 

For the continuous time model, we have also to consider the total time spent 
in each site until the instant (which is as for the discrete case the hitting 
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time of M for the processes X^^'^): for any x G [1, M — 1], 

(0 

Ui R 

We will denote by {Y^ in the continuous case) the a-field generated 
by the trajectories of the R independent random walks killed when they 
hit the coordinate M. P denotes the probability distribution of the whole 
system, whereas is the probability distribution of the walk for a given 
sequence of nucleotides a. Also (resp. Var" for the variance) is the 
expectation associated to P". 

In Section [21 we start by the estimation base by base, we define the informa- 
tion at site X for both cases and show that the expression of the probability 
to get a given base at a site x conditionally on the trajectories are a simple 
function of the information. Then we study the asymptotic (in R) of the 
probability that the maximum likelihood estimator gives a wrong base, we 
define and study a typical number of unzipping Rc which measures the qual- 
ity of our prediction. In a second time we are interested in the estimation of 
the whole molecule, we start with a general expression of the probability to 
get a specific sequence given the trajectories of R random walks. We show 
that the global maximum likelihood estimator converges. Then we study 
the probability to make at least one mistake and then h separate mistakes 
by considering this estimator. We focus on the continuous case, and just 
quote the differences with the discrete case. 

In Section [3l we study some possible improvements. The first one consists 
on a local modification of the force in order to trap the system in a specific 
region. It has a direct effect on the time spent in this region and therefore on 
the quality of the prediction. For the second one we also modify the force, 
it is now function of the binding energies, and also of the space. It allows a 
fast unzipping till the bases we are interested, and a fast decreasing of the 
probability to be wrong. 

2 Bayes estimator, asymptotics in R and typical 
number of needed unzipping R(. 

Most of the results of this section are based on the fact that we can compute 
easily the joint distribution {Lt''^^\Lx '^^^ = L^'Si — 1), in fact it is not more 
difficult to get the joint distribution L+ := 1 < x < M — 2) and as 
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we have not found it in the hterature, we first prove the following lemma 
for one random walk : 

Lemma 2.1. // we denote k = {kx , x G {1, . . . , M — 1}) with kM-i = 1; 
then 

(L+ = k) = n' ^/rr \ta-p^)'-'~'- 

x=2 \ ^ / 

In particular, for x G {2, • • • , M — 1}, 

(L+ = kx, L- = kx-i) = P'' (L+ = kx, L+_, = kx-i + 1) 
kx + kx-, - l\ (1 (5) 



kx 1 

where for simplicity we denote := := Lx'^^^ and 

11 . M-l 

with Tx := infj/c > 0, X^, = x}. is then easy to compute the following 
means and variances 



Px Px ^ / rp.. 



Yar'' = ^ - l\ and War'' (^i^) 



2-^- (7) 



r^P; 



X 



Proof. Formula ([5]) of the lemma can easily be obtained by using the Markov 
property of X for a given sequence 5, the mean and the variance of and 
S^^ are direct consequences. Therefore we just prove the expression of the 
joint distribution of . Define for n > 1, the event 

M-l 



:= fl {L+ = kx} 



where kM-i = 1 (there is always only one jump from M — 1 to M). Then, 
P^ (L+ = k) =F^ (Ai) = (L+ = ki\A2) P^ (^2) 



{1+ = ki\L+ = k2) P^ {A2 
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where the second equahty comes from the Markov property of the walk X 
given h. Formula ([5]) implies for any xG{2,-- - ,M — 1}, 



thus, 

and we get the result of Lemma 12.11 recursively. □ 
2.1 Prediction site by site 

In this section we always assume that f is constant. Let us begin with 
a general proposition true for the continuous and the discrete time cases, 
then we discuss the differences between the two cases. First we define the 
following function ix, called local information at site x of the system, it 
differs for the two cases. Let x G {2, • • • ,M — 1} and (a^-i, a^, a^+i) G 
{A,T,C,G}^. 

For the discrete case, the information is defined by 

ix{o!.x-i,ax, ax+i) ■= 
L+'-^log(l + e'^^^'("-"-+i) + L~'-^log(l + e-/^^9(°-"-+i)) 
+L+l-^log(l + g/^^f + L;i^log(l + e-/3^f 

and for the continuous case, by 

ix{ax-i,ax,ax+i) := /S^oK, + re-'^fo^^-^^+i) 
+ /3go{ax-i,ax)LtLl + ^f.^re-^^oK-i-a.). 

We are now ready to state the 

Proposition 2.2. For all x £ {2,--- ,M - 1}, and for ax G {A, T, C, G}, 

denoting = {hi, 1)2, ■■ ■ ,hx-i,bx^+i,- ■ ■ ,bM-i), we have 

p(;,. = „.|z«.r) = ^'"'f<-;-'°-'_''>' (8) 

Ea, exp(-/^(a^,6)) 

where 

Ix{u,b) = Ix{u,b){Z^) := ix{bx-i,u,bx+i) - logP(6^ = u\b''). 
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and is either for the discrete case or for the continuous one. 
The maximum likelihood estimator for b^, is given by: 

bx= ^ axl{/4a,,b)=min<i, (9) 

a^&{A,T,C,G} 

Assume that Hypothesis ^ is satisfied then the maximum likelihood estima- 
tor converges almost surely to b^- Moreover, 

lim -ilogPfS, /6,|Z^,P) = l/i?,(x)>0; (10) 

Rc{x) is called the typical number of random walks at site x. For the discrete 
case, f -almost surely, 

1 _ AG-(6,_i) ^ AG+(6,.+i) 



Rc{x) p^-i Px 

and for R large enough 

AG-(6,_i)^ + AG+(6,+i)^ + e,{R), 



Rc{x) ^ ^ R ' R 

where l!^G^ {bx+i) and /S.G^ {bx-i) are two positive numbers (see ill]) )- For 
the continuous case we get the same expression but replacing the constant 
AG+{bx+i) (respectively AG+{bx-i)) byAF+{bx+i) (respectively AF+ (bx-i) ) , 
see also their expression in il8\) . Also for the discrete and continuous case 



we have e^iR) ~ {RloglogRy/'^Lt'^^^ /R. We denote a{R) « d{R) if there 
exists a positive bounded number c such that a{R) = c* d{R). 

We first prove the result and then discuss about the expression of Rdx). 

Proof. We only give a proof in the discrete case. Formula ([8]) is a simple 
consequence of Bayes formula together with Lemma 12.11 and the expression 
of bx follows. By the strong law of large number (LLN), P^-almost surely, 

lim \rlx{ax,b) 

= f log(l + e^^9{a^,b^+i)\ _|_ g/3Ag(6:i,,6^+i) ^Qg^ _|_ ^-l^^9{ax,b^+i)\\ _|_ 
Px^ ^ 
^ f log(l + ^l3Ag{b^-i,a^)\^ _|_ ^l5Ag{b^-i,b^) {^g^^l _|_ g-/3Ag(b^_i,a^)-j\ _ 
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Recall that Ag is defined in ([3]). As the function 

X —7- log(l + x) + clog(l + 1/x) 

is minimal iS x = c, asymptotically the right-hand side of the previous 
equality is minimal iff Ag{bx-i,a:j:) = Ag{bx-i,bx) and Ag{ax,bx+i) = 
Ag{bx,bx+i), therefore with Hypothesis (jl]), = bx and the estimator bx 
is almost surely convergent. 

Now we are interested in the difference lim/j_!.+oo ^{Ix{bx,b) — Ix{c(x, b)), 
first let us define the function 



notice that Ga{x) is positive for all x a and Ga{a) = G'^{a) = 0. almost 
surely for all Ox 7^ bx 

lim ^{Ix(bx, b) - Ix{ax,b)) 

= -^G'Ag(fe,,6,+i) {Ag{ax,bx+i)) - ^G'Ag(6,_,,6,) {Ag{bx-i,ax)) , 

Px Px—1 

By Hypothesis ([3]) of local injectivity of go, we get that P^-almost surely, 
for ah oix / bx, limij^+oo ^{Ix{bx,b) - Ix{ax,b)) is strictly negative. Also 
notice that 

^ ^ glx(ax,b)—lxiax,b) _ ^Ixiax ,b)—Ixibx ,b) 'y ^ ^Ixibx ,b)—Ix(ax ,b) 

Oixi^O-x ax^O-x 



_ glx{ctx,b)-lx{bx,b) j 2 _|_ glx{bx,b)-lx{ax,b) 
\ axT^oix,bx 

therefore we have that P^-almost surely for R large enough 

g]^GA9{i,,,6,+i)(AsK,b.+i)) + p^GA,(i,,_i,6,){Ag(fe,_i,a,))-e4i?) 

< ^ exp(/a;(aa;,6) - /a;(Q;a;,6)) 

< 4g?^GA,(6,.5,+i){A3(a,,fe,+i)) + -^GA,{6,_i,6,)(A<;(6.-i,a.))+e.(R)^ 

where ex{R) is the error we make by using the LLN and by the presence of 
the logP(6a; = u\b^) in the expression of /, we examine this term at the end 
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of the proof. Now define 

AG+(6^+i) := max (GAg(fe,,6,+i)(A5r(a^, 6^+1)), (11) 
AG"(6^_i) := max(GAg(fe^_i,f,,)(A5f(6^_i,Qa;)), 

and finally notice that ¥ [bx ^ bx\Z^, ) can be written like 



6,. b^ 



CtxJ^bx 

\ ^+ J2 exp(4(a,,6) -4(q,,6)) , (12) 

ax^bx \ oixi-oix I 



we get that almost surely for R large enough 



log 1 + 46*^^-1 



^AG-(fe,_i)+^AG+(6,+i)+e,{fl) 



<logP(6,/6,|Z« 6") 



< log 4 — log 1 + e^^^-i 



that can be written like: 

logp(6,. 6-) ^ 1 



R Rc{x 



ex{R) + const 
- R ' 



where "const" is a constant real number. 

To finish the proof we have to study ex{R)- By the iterated loga- 
rithm law (ILL) ex{R) w {Rloglog RY/^{{Nar^Lt'^^^Y/^ + (Var^'L^'^^V^^) 
-logP(6^ = w|5^) where (Var*(L+))^/2 as well as (Var''(L-))V2 behaves 
like l/px (see Lemma [2.ip . Then ex{R) ~ y/Rlog log R/px-, this gives the 
expression of the proposition by using the LLN. Also to move from the result 
under the measure P** to the result under P we just notice that what we get 
is true for all sequences b. 

□ 

Notice that l/i?c {x) is the rate function in the large deviation theory so 
the above proposition gives informations on the decrease of the probability 
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to be wrong. 

The discrete case. We have obtained that P-almost surely for R large enough 

Rc{x) Px-l Px 

first note that we want AG~{bx-i) and AG'^{bx+i) as large as possible, 
unfortunately they may be very small, indeed 



Ga{u) = p\n-af^ + o{u-af 



thus when the correct energy go{bx-i,bx) (take for example 1,78 in the table 
of energies Figure [2]) is close to another one (take 1,55) we have 

AG-(6^_i) « /32 min {Ag{bx-i,ax) - Ag{bx-Mf 

= 0^ min {gQ{bx-i,ax) - go{bx-i,bx)f , 

Ctx^bx 

so AG~ (bx-i) and AG~{bx-i) can be small. However, this is not the only 
and worst case. Indeed, assume n < and a < 0, then for large /3, 

Gain) « exp(/3a)(exp(/3(ii - a)) - 1 - f3{u - a)), (13) 

which exponentially decreases with /3. This situation may appear when / is 
large and the binding energy at site x of the molecule is weak. We will see 
in Section [3] a method to avoid this situation. 
Turning back to the expression of 1/Rc{x), we also notice that 

^>exp(/3( , max (g{l) - g{x)))) 

Px x+l<l<M-l 

= exp(/3M,), (14) 

with Mx := maxx+i<i<M-i^Yli=x+i9o{bk,bk+i) - {I - x)gi{f)y So, as 
expected, the convergence is better if there are obstacles in the path from x 
to M. Finally, we have P-almost surely 

l/Rc{x) > exp (/3M,_i) AG- + exp (/3M,) AG+, (15) 

with 

AG" := min{ AG" (7), 7 e {A, T, G, G}} 

and 

AG+ :=min{AG+(7),7G{^,r,G,G}}. 
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Formula useful for the estimation. As we have seen above, Rc{x) character- 
izes locally the environment. However, what is really important to control 
the quality of estimation at a point x is not the number of walks R but the 
total number of passages at this point, := + L~ . So we define the 
typical number of visits at site x, Ldx) by 

-logPfb, 
lim — 

and we get P- almost surely 



1/Le(x):= lim ^—-^ ^, (16) 



^ >^(ag+aag- 



Lc{x) - 2 

Total amount of time to reach M . An other important factor is the time 
required to unzip totally R times the DNA molecule. It should not be too 
large. This time is given by: 

R R M-2 

1=1 1=1 x=l 

And by the LLN, P almost surely 

M-2 . 

i?exp(/3maxA/4) < [t§j] =rS2(- + — - 1 

^ \Px-l Px 

<RM exp (/3 max ) . 

X 

So, as seen in the previous paragraph (see p4|) ). large (3 can lead to a better 
prediction, however it slows down the system. Of course it is worse if there 
is large obstacles between x and M because in this case Mx is large too. 

The continuous time case. Like for the discrete case we first define a function 
F : M M+ by 

F{u) = e^'' -I- and 
AF~(7) = min (F(5o(7, "") - 5o(7, v)),u, v G {A, T, C, G},u / v) , 
AF+(7) = min {F{go{u, 7) - 9o{v, 7)), n, v G {A, T, C, G},u / v) , (18) 

AF- = min {AF' (7) , 7 G T, C,G}) , 

AF+ = min (AF+(7), 7 G {A, T, C, G}) . 
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Then a study, similar to the discrete case, leads to, P almost surely 

AF+ AF~ 1 
1/Rc(x) > + and 1/LJx) > -(AF+ A AF"). 

Px Px-l 2 

Note that the bad case observed for the discrete time model (see equation 
(fT3]l ) does not appear here, however when (u — a) is small, F{u — a) is as 
Ga{u) of the order of (a — u)^. 

In the next section we look at the entire molecule, we define global 
information and study the decreasing of the probability to make a mistake 
by using the global maximum likelihood estimator. 

2.2 Inferring the whole molecule 

Define the global information I of the whole molecule, let a G {A, T, C, G}*^, 
for the discrete case , 

I{a) := -logP(6 = q)+ 

^ Lp^ log(l + e/^^5(«.,".+i)) + L~,R iog(i + e-/3A3K,a,+i))_ 

x=l 

and for the continuous case = {X^,T^), 

M-l 

/(a) = -logP(6 = a)+ ^ /35o(ax,ax+i)L+'^ + re-'^^°("-"-+^)5i^ (19) 

x=l 

The global maximum likelihood estimator converges to b : 
Theorem 2.3. For any a G {A,T,C,G}^^ with ai = bi, we have: 

F{b = a\Z^, h) = ^f^. (20) 

The global maximum likelihood estimator is the element b of {A, T, C, G}*^ 
which minimizes the function I. 

Assume that ^ is satisfied then the global maximum likelihood estimator 
converges almost surely to the DNA chain b. 

Proof. We only give the proof for the continuous case, the discrete one uses 
the same ideas. For a realization of = {X^'^\--- , X^^\ T^^\--- , T^-^)), 
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Bayes Lemma gives : 



Notice that we still use P to denote a probability density. When = xf^ , 
and the environment a are given, T^^'^ is an exponential variable, independent 

of the other T*-'^)), and of parameter r j e ^ e~^^^^^' 

if xf G {2, • • • , M - 1} or re-^^o^^i'^^^) if ^.W ^ j rpj^^g^ 



1^ R 



b = a 



(0 1 

R 



(0 3.(0. 



/ 1=1 i=l 

M-l 



a;=2 



(0 1 

R 



(0 1 



where Zf = ^ 1 and = E E ^J)! 



;=i fc=i 



1=1 k=l 
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Moreover 
/ -R 



\/=i 

M-l / 

n 

x=2 \ 



b = aj 

g-/9go(ai^,Oa; + l) 



j + ,R 



l(/) 



-l3go(ax,ax+i) _|_ g-/3gi(/) 



-l3go{ax,ax+i) _|_ ^-l^giif) 



TT ^ ^ 

x=2 (e~^3o(a^,a^+i) _|_ e~^3i(/)y^ 



where 



(0 



(0 • (0 • , 1 • 



i=l k=l 

Then we have the following equality 



M-l 

n 



^««g-s«r(e-'3«o(-^.".+i)+e-/39i(/))-«+.«/3goK,a.+i)-(/+l^-l)/3si{/) 



a;=2 



X ^;fg-sfre-S9o(-i."2)_i+-«,3go(„i,„2)^ 

Assembling the different expressions we get the formula for P(6 = a\Y^). 

We now prove the convergence of the maximum likelihood estimator. 
According to Lemma [2. 11 the LLN and the LIL, for any x G {1, • • • , M — 1}, 

almost surely for R large enough 

Lt^R = -4^ + e^{R) and = ^^^-—r, + ex(i?). 



Px{h) 



rpx{b) 



Then for any a E {A,T,C,G}^\ P-almost surely the information I{a) is 
equivalent to 



M-l 



I{a) = -R ^ (/35o(ax, Ox+i) + e" 



■^(50(o^^,Oa: + l)-50(fea;,6^: + l))^ 



1 



x=l 



Px{b) 



+ tx{R)- 
(21) 
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As for any real number c, the function x — )■ x + e ^'^'^ is minimal iff x = c, 
the sum is minimal if and only if for each xG {!,••• ,M — 1}, 

go{ax-,OLx+i) = go{bx,bx+i)- 

So by Hypothesis dH) and the equality ai = 61, P almost surely for R large 
enough, I{a) is minimal iS a = b. □ 

2.3 Control of the estimation for the continuous time case 

In this part, we show that the probability to make at least one mistake using 
the global estimator b decreases exponentially. 

Corollary 2.4. Let rie be the number of wrong predictions, then V-almost 
surely, 

R-^+oo R 

Proof. From the first part of Theorem 12.31 we know that 

1 



(ne > l\Z^,bi) = l-^[b = b\Z^,bi) = 1 



1 + V -e-(m-m) 



The second part of Theorem 12.31 together with Equation (|21|) give that 
almost surely for R large enough 



M-l 



I{a) - I{b) = Lp^F{go(bx, k+i) - 5o(ax, a^+i)) + e^iR), 



x=l 



recall that, for n £ M, F{u) = e^^ — 1— fiu and ex{R) is defined in Proposition 
12.21 Notice that we also have 

I{a) -I{b) = V —^F{gQ{bx,bx+i) - fi-oCax, ax+i)) + ex(^) 

M-l 

> ^ XI F{go(bx,bx+i) - go{ax,ax+i)) + ex{R)- 

x=l 

For a ^ b, denote by ay the first site such that Uy / by, obviously y > 2 
and therefore, 

I{a) - I{b) >R-F (go{by-i,by) - go{by_i,ay)^ + 0(V^loglogi?) 
> R ■ AF- + Oi^RloglogR) 
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so finally we get 

^e-U(a)-^W) < ^M^-R-AF-+OiVmoglogR) 

which concludes the proof. □ 

Now let us define fie the number of non successive errors; by non succes- 
sive, we mean that two errors are separated by at least one good prediction. 
The probability to make more than h non successive errors exponentially 
decreases: 

Corollary 2.5. Let /i G N, '^'-almost surely, 

R-^-\-oo R 
Proof. Like in the previous section we easily compute 

P (fie > h\Z^, bi) = ¥{be Ah\Z^, bi) 



1 



- e 



-/(a) ■ 



-/(a) 



where A}i is the set of the chains a which are different from b in at least h 
non successive sites and An is the complementary set. As 6 G An-, 



{he > h\Z^,bi) < 1 + 



thus, as before, we just have to study the quantities 1(a) — I{b) where 
a e Ah- 

Here the only important contribution for a given chain a comes from the 
sites y such that ay is different from by but a^-i = fey-i- As every chain of 
Ah has at least h such points, we obtain in the same way as before, 

^ g-(7(a)-7(6)) < 4Mg-R/iAF-+0(V«log log R) ^ 
ceeAh 

it is then easy to obtain the result of the corollary. □ 
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This corollary shows that we have very few chances to make several 
mistakes at distant bases of the molecule, however notice that we can not 
replace he by rig, indeed the probability to be wrong at h successive sites 
does not decrease exponentially in h. We actually note that if we get a 
mistake in one site then there is a great probability to make mistakes on the 
following bases. 

The discrete time case leads to very similar results, in fact the main 
difference is that AF~ is replaced by AG~ . 



3 Possible improvements of the method 

In this paragraph we use the results of the previous sections to propose two 
simple extensions which improve the prediction. In both of them the idea 
is to adapt the force / to the context. 



3.1 Forces depending on the coordinate of a site 

In this section, we mainly discuss about ^ which appears in the expression 
of 1/Rc{x). As seen before, ^ can be large but it depends on the sequence 
go and the force at site x. We recall that 

M-i / r I ^\ 



1 

Px 




-/3 <{l - x)gi{f) - ^ go{bk,bk+i) 

I k=x+l 



For example when the force / is not too large, some valleys, that is to say 
portions of the sequence b such that '}2\=x+i9^i^k-,bk+i) — — x)9i{f) is 
large for a given /, can appear. So the quality of the prediction is good only 
in some specific regions of the molecule (the decrease of the probability to 
be wrong behaves like g-constije''^^^ ^ where Mx is given in (|14p ). 

When the above condition does not appear, the force can be modified in 
order to slow down locally the syste Assume that we are interested in a 
specific region centered at the coordinate y, [y — A^y + A\ for some ^ > 0, 
where the {L^j^^ + L~j^^,x E [— ^, ^]) or the {Syj^x, x G [— j4,j4]) are small. 
Then we can take for x G [—A, 

gi{fy+x) = C{A-x), (22) 



^According to the physicists, it is possible. 
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for some small constant C > 0, we get: 



1 



> exp -/3 < 



Py+x 

especially for x = 0, 



A+y 



— {A-x){A-x-l)- ^ go(bk,bk+i) 

k=y+x+l 



1 

Py 



C 



> exp -/3 <^ -A{A - 1 



A+y 

^ go{h,bk 

k=y+l 




Then if ]K{gQ{bk, bk+i)) > C{A — 1) and A is large enough, i will be quite 
large too. Once again that will work if the region we are looking at is quite 
far from the end of the molecule, that is to say, A is large. On the other 
case what could be a good idea is to unzip the molecule from the end. We 
now move to another possible improvement. 



3.2 The energy point of view: forces depending on the val- 
ues of the environment 

In this paragraph we do not try to find directly the sequence of bases but the 
associated binding energies. We denote go{x) for go{bx, bx+i) and we assume 
that there are K distinct values for go{x), typically for a DNA molecule 
they are given by Table [2j Note that the random variables {go{x),x) are 
not independent, and that the dependence is also given by Table [21 For 
example, the energy 1.06 can only be followed by 1.78, 1.55, 2.52 or 2.22. 
We will keep the notation when we work at fixed energy. To simplify the 
computations we also assume that the sequences go are equiprobable. 

First let us introduce some new notations. We will denote by ^i, /i2, • • • , tJ-K, 
the possible values of goi-), ordered in such a way that fii > /ij+i for 
all i. We also assume that the force / can take K + 1 decreasing values 
{/i) • ■ ■ ; Jk-i, Ik, Ik+i = 0} such that gi{.) takes K + 1 distinct values 
denoted {ri, r2, • • • , rK-i,rK,rK+i = 0} and satisfying 

ri> r2> ■■■ > rx^i, 

Hi-ri < 0, /ii - r2 > 0, yi > 1 fii - r2 < 0, 
H2-r2 < 0, /U2 - rs > 0, Vz > 2 - rs < 0, 



fJ-K -rK < 0, fiR - rR+i = fJ-K > 0- 
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Let us define ql^ := (1 + e^'^'^"'~^*^)~^ . This is the probabiHty to go 
on the right if the force fi is apphed and if the value of the environment 
is equal to ^m- Notice that if /i is applied then for all x < M, px ■= 
(1 + exp(/3(3o(x) - ri))r^ > ql > 1/2. We denote Ti := {x < M,go{x) = 
/Lti}. Then if /2 is applied, for all a; < M, x ^ Vi, > q^- We therefore 
denote Fj := {x < M, go{x) = Hi} for i G {2, . . . , K} and we get a partition 
{ri,r2,-- - of {!,••• ,M}. The idea is then to consider a certain 

number of random walks for each values taken by the force. We denote by 
Rj the number of random walks we consider for the force with value fj. 
Prom now on we will only focus on the discrete time case, indeed it is the 
one where the gain is the most important, however what we suggest can be 
applied to the continuous time model as well. We introduce the information 
at site X when the force fj is applied: 

4(m) := Lp^' log (l + e'^^'^'"-''^)) + L~'^' log (l + e-'^^^-"''^)) (23) 

and the relative information at site x 

4(m,Z):=4(m)-4(Z). (24) 

We also define the function i^a : M — >■ M+, 

Ha{u) := log(l + e^«) + exp(/3a) log(l + e"^"). 

Proposition 3.1. Let k < K, assume that the forces fk and then Jk+i o.re 
applied (everywhere) then, for any x, any sequence and any estimator 
9o{x), 

Let us define the following estimator: 



= 1 + 



go{x)=fl r -.flfe+i 1, (25) 



-fU>0- ^<l, 7TT7^>1 ' 



then F-almost surely, 



1 



R^{x) ■ Rk=Rk+i=R-^oo R 

Px vV'^ 



lim_ ilogP (3o(x) = ilk. mix) ^ fo) 
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where 



H^^^ := min H^^^rM " ^fc) " H^^-r^{fik - rk), 

(G|k— 1,/c+l} 

H^^+^^ := min - r^+i) - - r^+i), and 

ig{fe— i,fe+i} 



1 1 



^ exp 9o{y)-ri +1. (26) 



We first give a short proof of the result and then discuss about the 
improvement. 

Proof. The first part of the proposition is, hke before, easily deduced from 
Bayes formula. Thanks to Lemma l2. II and the LLN, P''-almost surely 

and in the same way P*-almost surely 

lim inf < > 0, -^--^ < 1, — > 1 } 



X 



= inf {k > 0, 5o(a;) - ^fc < 0, go{x) - r^+i > 0} . (27) 

This implies the P^-almost sure convergence of {go{x) = fij.} to the event 
{go{x) = /ifc}. Therefore as Ha{.) gets its minimum in a, ¥^ almost surely 
on {go{x) = /Ufc} for all m ^ k 



i^{m, k) _ I 



A similar analysis can be done for i^'^^{m,k) so F^-almost surely 



{m,k) _ 1 rjik+i), 



lim ^ ^ ' ^ =: ^Am^^^^(m). 
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We get that P^-almost surely for = Rk+i = R large enough 
logP(5o(x) = Mfc, 50(2;) 5o) 
<log( jz -p(-|A<)M-^A<+^)M + o(i?)) 

The o{Rk) is the negligible term that comes from the ILL (see the end of 
the proof of Proposition \2.2\i which is of order of ^/R\oglogR. This gives 
the desire result by dividing by R. □ 



Here we avoid a bad situation seen in the first section (see ()13p ): for 
large /3 

F('=+^)«/3exp(/3(^fc-rfc+i)) 

which exponentially increases with /3. However we have to be careful with 
this method. In order to catch the small values of the energy, should be 
small and may slow down the system (see (jl7p and Lemma l2.ip . indeed we 
have 

x=l Px-1 J x=l Px^l ) 

>i?exp(/3maxM|+^), 

X 

and M^+i := max3,<,<M-2 50 (^«, - ?'fc+i} is large if k is close 

to K. 

An alternative approach is first to apply a large force /i from to x — 1 in 
order to reach quickly the region we are interested in, then to apply all the 
forces in x and then, after rc + 1 to apply a small force (for example /x) 
in order to slow down the system and stay focus on x. More precisely / 
depends on the energie as before but it also depends on the site: 

U{A = /lll<z<x'-l + fi^z=x + /a'1z>x+1- 
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We get the following P-almost sure result 
1 ,.1 



lim 



Rc{x) Rj=R-^+oo, \/l<j<K R 



log¥{go{x)y^go{x)\{X''',i< K),g^) 



(28) 



K<K — 1 <e{fe— 1,A;+1} 

k<K—l te|fc— l,fe+l| 

The main interest in the above result comparing to the previous one is the 
fact that 4r is large but the time to reach x is small. Indeed 



Rx X 



^1 + exp 


f ( 

13 max < 


1 1<1<X-1 








<2Rxx. 



Of course this also increases the time required to reach the end of the 
molecule, but we can imagine that the process can be stopped once the 

precision for the site x is reached. The proof to get the above expression is 
very close to the previous one so we do not give any details. 

A last remark, the prediction depends on the rest of the unknown se- 
quence due to the presence of i/Px ■ We can imagine an extreme case 
where the forces fi{z) = fili<:,<^_i + fi1z=x + fK+i^z>x+i from i = 1 
to K are applied, which means that the molecule can not be split after the 
base X. In this case we would have P-almost surely for any sequence gQ{x), 



liix) ~ R-^oo ^ ^ 5o(x)|X«, g^oJii-),i < K)) 



(29) 



so at least asymptotically we get a lower bound for 1/Rc{x) which is inde- 
pendent of ^0 aiid exponentially increasing in p. 

Wc conclude with a discussion about the link between the energie and 
the sequence of bases. First let us recall the table of the binding free energies 
for DNA at room temperature: 



50 


A 


T 


C 


G 


A 


1.78 


1.55 


2.52 


2.22 


T 


1.06 


1.78 


2.28 


2.54 


C 


2.54 


2.22 


3.14 


3.85 


G 


2.28 


2.52 


3.90 


3.14 
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Notice that the largest free energies which correspond to the most stable 
links are on the bottom right end corner of the table, in fact the largest 
binding energy is obtained when a G is followed by a C. Notice also that 
go{G,G) = go{C,C) so we can not distinguish these two different links 
by looking only at the free energy. In the same way the lowest free en- 
ergy is produced by bases T and A followed by the same letters, once 
again go{A, A) = (70 (T,T). For the rest of the table we have the equal- 
ity goiW, S) = goiS, W), where S is either a C or a G and W a A oi aT, S 
(respectively W) is the complementary of S (respectively of W). 

Moreover it is possible to reconstruct the DNA molecule from the com- 
patible binding energies only if there is only one sequence of base pairs which 
corresponds to the sequence of energies (see Theorem 12 .Sp . This is not al- 
ways the case, for example when the molecule repeats the same scheme: 
the energy of C — C — • • • — C is equal to the energy of G — G — ■ ■ ■ — G, 
in the same way A — G — A — G — — A — G has the same energy than 
G — T — G — T — -- - — G — T. Notice that if these highly improbable sequences 
are broken only once in the molecule then we turn back to a solvable case. 
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